1 Part 1: Association Rule Mining

1.1 Introduction

Association rules mining, often also referred to as ‘Market Basket Analysis’, is a data mining method used to find useful insights to a particular domain. It is a rule-based machine learning method designed to discover frequent co-occurring associations among a collection of items or transactions.

An association rule has two parts, an antecedent and a consequent (item that is found in combination with the antecedent).

The strength of an association rule can be measured in terms of its support and confidence. Support tells us how frequently the items or the combination antecedent and consequent appear in the database. This is of interest due to the fact that if a rule is measured to be very low in support, it is likely to be uninteresting from a business perspective. Confidence, on the other hand indicates the probability that a transaction containing the antecedent also contains the consequent. It essentially measures the reliability of the inference made by a rule.

Another interesting evaluation metric is Lift, which is an correlation indicator that indicates how often a rule happenes compared to the estimated chance for it to happen. If the lift value is equal to 1 it means that the items are intependent of each other whereas if lift >1 it tells us the degree to which the items are dependent of each other.

1.2 Data

For this practice we are using the Extended Bakery Dataset containing 75,000 receipts from a bakery chain that has a menu of about 40 pastry items and 10 coffee drinks. The database stores information about the food/drinks offered for sale, locations, employees at each location and individual sales (receipts) at those locations.

The aim is to to find frequent itemsets and interesting association rules.

Let’s start by loading the data files.

  1. Description File:
items = read.csv("/Users/kelly/Documents/MADM/CURSOS/N. Te. Mi. Da./Part 3/Practica Final/BAKERY DATASET-20190525/bakerygoods.txt", sep = "\t", stringsAsFactors = FALSE)
head(items)
  id        name  type price category
1  0  Chocolate  Cake   8.95     Food
2  1      Lemon  Cake   8.95     Food
3  2     Casino  Cake  15.95     Food
4  3      Opera  Cake  15.95     Food
5  4 Strawberry  Cake  11.95     Food
6  5    Truffle  Cake  15.95     Food
  1. Transaction data
# Get max number of items in one transaction
max_items = max(count.fields("/Users/kelly/Documents/MADM/CURSOS/N. Te. Mi. Da./Part 3/Practica Final/BAKERY DATASET-20190525/75000/75000-out1.csv", sep = ","))
# Read data
data = read.table("/Users/kelly/Documents/MADM/CURSOS/N. Te. Mi. Da./Part 3/Practica Final/BAKERY DATASET-20190525/75000/75000-out1.csv", sep = ",", header = FALSE, fill = TRUE,
                     col.names = c("id", paste0("item",1:(max_items-1))))
head(data)
  id item1 item2 item3 item4 item5 item6 item7 item8
1  1    11    21    NA    NA    NA    NA    NA    NA
2  2     7    11    37    45    NA    NA    NA    NA
3  3     3    33    42    NA    NA    NA    NA    NA
4  4     5    12    17    47    NA    NA    NA    NA
5  5     6    18    42    NA    NA    NA    NA    NA
6  6     2     4    34    NA    NA    NA    NA    NA

We can see that each row represents a transaction. The first column states the receipt-ID and the remaining columns, the item-IDs that were sold in that receipt. We can remove the ID-column since it does not hold any useful information. Furthermore, we want to replace the item ID’s with the corresponding item names found in the description table. To do so we need to combines the name and type label to get the different unique items.

# Remove ID column
data = data[,(names(data) != "id")]

# Match item-IDs with item names
items$item = paste0(items$name, items$type)
data = data.frame(lapply(data, function(x) x = items$item[match(x, items$id)]))
head(data)
              item1             item2                item3
1        Apple Pie    Ganache Cookie                  <NA>
2    Coffee Eclair         Apple Pie         Almond Twist 
3       Opera Cake  Cheese Croissant         Orange Juice 
4     Truffle Cake        Apple Tart       Chocolate Tart 
5 Chocolate Eclair       Cherry Tart         Orange Juice 
6      Casino Cake   Strawberry Cake  Chocolate Croissant 
                 item4 item5 item6 item7 item8
1                 <NA>  <NA>  <NA>  <NA>  <NA>
2          Hot Coffee   <NA>  <NA>  <NA>  <NA>
3                 <NA>  <NA>  <NA>  <NA>  <NA>
4 Vanilla Frappuccino   <NA>  <NA>  <NA>  <NA>
5                 <NA>  <NA>  <NA>  <NA>  <NA>
6                 <NA>  <NA>  <NA>  <NA>  <NA>

1.3 Finding Frequent Itemsets

Apriori requires a transaction object. In order to create it, we create a new csv file with our modified data read the new CSV-file back as an transaction file.

write.table(data, "receipts_new.csv", sep = ",", col.names = FALSE, row.names = FALSE, na = "")
t.data = read.transactions("receipts_new.csv", sep = ",")
summary(t.data)
transactions as itemMatrix in sparse format with
 75000 rows (elements/itemsets/transactions) and
 50 columns (items) and a density of 0.07098907 

most frequent items:
  Coffee Eclair      Hot Coffee    Tuile Cookie     Cherry Tart 
           8193            7700            7556            6987 
Strawberry Cake         (Other) 
           6948          228825 

element (itemset/transaction) length distribution:
sizes
    1     2     3     4     5     6     7     8 
 3592 13579 24674 17003  8640  3840  2191  1481 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   3.000   3.549   4.000   8.000 

includes extended item information - examples:
            labels
1 Almond Bear Claw
2 Almond Croissant
3      Almond Tart

The data comprises 75000 transactions and 50 different items. We can say that the data set is rather sparse with a density of only around 7%. The average transaction contains 3-4 items.

Next we will have a look at the top 10 most frequently purchased items and their relative frequency.

itemFrequencyPlot(t.data, topN = 10, type="relative", main="Top 10 Items")

The most popular item is a Coffee Eclaire followed by hot coffee. However, we can see that it’s relative frequency is only just above 0.10, this means that in general the items have a low support value.

So far for individual items. However, we are interested in finding frequent itemsets (combination of items in the same transaction). We can use the the Eclat algorithm to mine frequent itemsets, we will set the support value to 0.02 and maxlen (maximum number of items in a itemset) to 3, we also need to det the minlen to 2 in order to get itemsets of more than one item.

freq_itemsets = eclat(t.data, parameter = list(support = 0.02, minlen = 2, maxlen = 3))
Eclat

parameter specification:
 tidLists support minlen maxlen            target   ext
    FALSE    0.02      2      3 frequent itemsets FALSE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 1500 

create itemset ... 
set transactions ...[50 item(s), 75000 transaction(s)] done [0.03s].
sorting and recoding items ... [50 item(s)] done [0.00s].
creating sparse bit matrix ... [50 row(s), 75000 column(s)] done [0.01s].
writing  ... [66 set(s)] done [0.06s].
Creating S4 object  ... done [0.00s].
inspect(freq_itemsets)
     items                    support count
[1]  {Apple Croissant,                     
      Apple Tart,                          
      Cherry Soda}         0.02077333  1558
[2]  {Apple Croissant,                     
      Apple Danish,                        
      Cherry Soda}         0.02080000  1560
[3]  {Apple Danish,                        
      Apple Tart,                          
      Cherry Soda}         0.02073333  1555
[4]  {Apple Tart,                          
      Cherry Soda}         0.02278667  1709
[5]  {Apple Danish,                        
      Cherry Soda}         0.02313333  1735
[6]  {Apple Croissant,                     
      Cherry Soda}         0.02289333  1717
[7]  {Chocolate Tart,                      
      Vanilla Frappuccino,                 
      Walnut Cookie}       0.02676000  2007
[8]  {Vanilla Frappuccino,                 
      Walnut Cookie}       0.02848000  2136
[9]  {Chocolate Tart,                      
      Walnut Cookie}       0.02856000  2142
[10] {Blackberry Tart,                     
      Coffee Eclair,                       
      Single Espresso}     0.02720000  2040
[11] {Coffee Eclair,                       
      Single Espresso}     0.02949333  2212
[12] {Blackberry Tart,                     
      Single Espresso}     0.02946667  2210
[13] {Green Tea,                           
      Lemon Cookie,                        
      Lemon Lemonade}      0.02085333  1564
[14] {Green Tea,                           
      Lemon Cookie,                        
      Raspberry Lemonade}  0.02093333  1570
[15] {Green Tea,                           
      Lemon Cookie,                        
      Raspberry Cookie}    0.02077333  1558
[16] {Green Tea,                           
      Lemon Lemonade,                      
      Raspberry Cookie}    0.02086667  1565
[17] {Green Tea,                           
      Raspberry Cookie,                    
      Raspberry Lemonade}  0.02085333  1564
[18] {Green Tea,                           
      Lemon Lemonade,                      
      Raspberry Lemonade}  0.02080000  1560
[19] {Green Tea,                           
      Lemon Lemonade}      0.02302667  1727
[20] {Green Tea,                           
      Raspberry Lemonade}  0.02292000  1719
[21] {Green Tea,                           
      Raspberry Cookie}    0.02314667  1736
[22] {Green Tea,                           
      Lemon Cookie}        0.02308000  1731
[23] {Apple Croissant,                     
      Apple Danish,                        
      Apple Tart}          0.02550667  1913
[24] {Apple Croissant,                     
      Apple Tart}          0.02784000  2088
[25] {Apple Croissant,                     
      Apple Danish}        0.02789333  2092
[26] {Apple Danish,                        
      Apple Tart}          0.02785333  2089
[27] {Berry Tart,                          
      Bottled Water}       0.03780000  2835
[28] {Chocolate Tart,                      
      Vanilla Frappuccino} 0.03596000  2697
[29] {Lemon Cake,                          
      Lemon Tart}          0.03685333  2764
[30] {Casino Cake,                         
      Chocolate Cake,                      
      Chocolate Coffee}    0.03338667  2504
[31] {Casino Cake,                         
      Chocolate Cake}      0.03553333  2665
[32] {Casino Cake,                         
      Chocolate Coffee}    0.03524000  2643
[33] {Blackberry Tart,                     
      Coffee Eclair}       0.03641333  2731
[34] {Lemon Cookie,                        
      Lemon Lemonade,                      
      Raspberry Cookie}    0.02576000  1932
[35] {Lemon Cookie,                        
      Raspberry Cookie,                    
      Raspberry Lemonade}  0.02568000  1926
[36] {Lemon Cookie,                        
      Lemon Lemonade,                      
      Raspberry Lemonade}  0.02562667  1922
[37] {Lemon Cookie,                        
      Lemon Lemonade}      0.02781333  2086
[38] {Lemon Cookie,                        
      Raspberry Lemonade}  0.02782667  2087
[39] {Lemon Cookie,                        
      Raspberry Cookie}    0.02782667  2087
[40] {Lemon Lemonade,                      
      Raspberry Cookie,                    
      Raspberry Lemonade}  0.02574667  1931
[41] {Lemon Lemonade,                      
      Raspberry Cookie}    0.02801333  2101
[42] {Raspberry Cookie,                    
      Raspberry Lemonade}  0.02772000  2079
[43] {Lemon Lemonade,                      
      Raspberry Lemonade}  0.02784000  2088
[44] {Cheese Croissant,                    
      Orange Juice}        0.04306667  3230
[45] {Gongolais Cookie,                    
      Truffle Cake}        0.04392000  3294
[46] {Napoleon Cake,                       
      Strawberry Cake}     0.04314667  3236
[47] {Almond Twist,                        
      Apple Pie,                           
      Coffee Eclair}       0.03432000  2574
[48] {Almond Twist,                        
      Apple Pie,                           
      Hot Coffee}          0.02805333  2104
[49] {Apple Pie,                           
      Coffee Eclair,                       
      Hot Coffee}          0.02809333  2107
[50] {Apple Pie,                           
      Coffee Eclair}       0.03726667  2795
[51] {Apple Pie,                           
      Hot Coffee}          0.03102667  2327
[52] {Almond Twist,                        
      Apple Pie}           0.03668000  2751
[53] {Almond Twist,                        
      Coffee Eclair,                       
      Hot Coffee}          0.02812000  2109
[54] {Almond Twist,                        
      Coffee Eclair}       0.03712000  2784
[55] {Almond Twist,                        
      Hot Coffee}          0.03092000  2319
[56] {Apricot Croissant,                   
      Blueberry Tart,                      
      Hot Coffee}          0.03282667  2462
[57] {Blueberry Tart,                      
      Hot Coffee}          0.03504000  2628
[58] {Apricot Croissant,                   
      Blueberry Tart}      0.04350667  3263
[59] {Chocolate Cake,                      
      Chocolate Coffee}    0.04404000  3303
[60] {Apricot Danish,                      
      Cherry Tart,                         
      Opera Cake}          0.04110667  3083
[61] {Cherry Tart,                         
      Opera Cake}          0.04337333  3253
[62] {Apricot Danish,                      
      Opera Cake}          0.04302667  3227
[63] {Apricot Croissant,                   
      Hot Coffee}          0.03537333  2653
[64] {Marzipan Cookie,                     
      Tuile Cookie}        0.05092000  3819
[65] {Apricot Danish,                      
      Cherry Tart}         0.05309333  3982
[66] {Coffee Eclair,                       
      Hot Coffee}          0.03156000  2367

1.4 Mining Strong Association Rules

Now we want to find strong association rules using the Apriori Algorithm. Because we want to find strong association rules we will set the minimum threshold for the confidence indicator at 80%, meaning that whenever an item X item was purchased, item Y was also purchased 80% of the time. We will set the min. support value to 0.01 as we have seen that the relative purchase frequency of the items is quite low.

rules = apriori(t.data, parameter = list(supp = 0.01, conf = 0.8))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.8    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 750 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[50 item(s), 75000 transaction(s)] done [0.03s].
sorting and recoding items ... [50 item(s)] done [0.00s].
creating transaction tree ... done [0.04s].
checking subsets of size 1 2 3 4 5 done [0.01s].
writing ... [85 rule(s)] done [0.00s].
creating S4 object  ... done [0.02s].
summary(rules)
set of 85 rules

rule length distribution (lhs + rhs):sizes
 3  4  5 
52 28  5 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   3.000   3.000   3.447   4.000   5.000 

summary of quality measures:
    support          confidence          lift            count     
 Min.   :0.02059   Min.   :0.8049   Min.   : 7.924   Min.   :1544  
 1st Qu.:0.02073   1st Qu.:0.9070   1st Qu.:12.873   1st Qu.:1555  
 Median :0.02093   Median :0.9229   Median :13.403   Median :1570  
 Mean   :0.02454   Mean   :0.9319   Mean   :13.035   Mean   :1840  
 3rd Qu.:0.02676   3rd Qu.:0.9922   3rd Qu.:14.514   3rd Qu.:2007  
 Max.   :0.04111   Max.   :1.0000   Max.   :14.784   Max.   :3083  

mining info:
   data ntransactions support confidence
 t.data         75000    0.01        0.8

In total 85 rules with a confidence value of 0.8 or higher were generated.

1.4.1 Redundant Rules

Rules having the same support and confidence as more general rules are considered redundant association rules. Which means that rule is more general than another more specific rule if the specific rule can be generated by adding additional items to either the antecedent or consequent of the general rule.

Let’s see if we have any redundant rules.

sum(is.redundant(rules, measure = "lift"))
[1] 0

Seems like there are no redundant rules.

1.4.2 Rules by Confidence

Let’s have a look at the first 10 rules with highest confidence.

rules_by_conf = sort(rules, by="confidence", decreasing=TRUE)
inspect(rules_by_conf[1:15,]) 
     lhs                     rhs                     support confidence      lift count
[1]  {Green Tea,                                                                       
      Lemon Cookie,                                                                    
      Raspberry Cookie,                                                                
      Raspberry Lemonade} => {Lemon Lemonade}     0.02073333  1.0000000 14.654162  1555
[2]  {Green Tea,                                                                       
      Lemon Cookie,                                                                    
      Lemon Lemonade,                                                                  
      Raspberry Cookie}   => {Raspberry Lemonade} 0.02073333  1.0000000 14.760874  1555
[3]  {Green Tea,                                                                       
      Lemon Cookie,                                                                    
      Lemon Lemonade,                                                                  
      Raspberry Lemonade} => {Raspberry Cookie}   0.02073333  1.0000000 14.784151  1555
[4]  {Green Tea,                                                                       
      Lemon Lemonade,                                                                  
      Raspberry Cookie,                                                                
      Raspberry Lemonade} => {Lemon Cookie}       0.02073333  0.9993573 14.693550  1555
[5]  {Green Tea,                                                                       
      Lemon Cookie,                                                                    
      Raspberry Cookie}   => {Raspberry Lemonade} 0.02073333  0.9980745 14.732451  1555
[6]  {Green Tea,                                                                       
      Lemon Cookie,                                                                    
      Raspberry Cookie}   => {Lemon Lemonade}     0.02073333  0.9980745 14.625945  1555
[7]  {Green Tea,                                                                       
      Lemon Lemonade,                                                                  
      Raspberry Lemonade} => {Raspberry Cookie}   0.02074667  0.9974359 14.746243  1556
[8]  {Lemon Cookie,                                                                    
      Lemon Lemonade,                                                                  
      Raspberry Lemonade} => {Raspberry Cookie}   0.02556000  0.9973985 14.745691  1917
[9]  {Green Tea,                                                                       
      Lemon Lemonade,                                                                  
      Raspberry Lemonade} => {Lemon Cookie}       0.02073333  0.9967949 14.655874  1555
[10] {Lemon Cookie,                                                                    
      Raspberry Cookie,                                                                
      Raspberry Lemonade} => {Lemon Lemonade}     0.02556000  0.9953271 14.585684  1917
[11] {Almond Twist,                                                                    
      Apple Pie,                                                                       
      Hot Coffee}         => {Coffee Eclair}      0.02792000  0.9952471  9.110648  2094
[12] {Green Tea,                                                                       
      Raspberry Cookie,                                                                
      Raspberry Lemonade} => {Lemon Lemonade}     0.02074667  0.9948849 14.579204  1556
[13] {Green Tea,                                                                       
      Lemon Lemonade,                                                                  
      Raspberry Cookie}   => {Raspberry Lemonade} 0.02074667  0.9942492 14.675987  1556
[14] {Green Tea,                                                                       
      Raspberry Cookie,                                                                
      Raspberry Lemonade} => {Lemon Cookie}       0.02073333  0.9942455 14.618391  1555
[15] {Green Tea,                                                                       
      Lemon Cookie,                                                                    
      Lemon Lemonade}     => {Raspberry Cookie}   0.02073333  0.9942455 14.699076  1555

We can see that all 10 rules have a confidence of 1 or very close to 1 which means that whenever the antecedent items was purchased, the consequent item was also purchased 100% or close to 100% of the cases. The rules have also very high lift values. This means that the correlation between all the items in their respective rules is also very high.

What is also interesting is that all 10 rules are a combination of the same items, whch are: Lemon Lemonade, Raspberry Lemonade, Lemon Cookie, Raspberry Cookie and Green Tea. Where it seems like the antecedents and consequent are switching places (a -> b and b -> a) in order to form new rules with only minor variations in support, confidence and lift.

So this is definetly an important itemset to consider for the Bakery owner.

Let’s visualize the outcome.

plot((rules_by_conf)[1:15,], method = "graph")

1.4.3 Rules by Support

Next let’s take a look at the 10 rules with highest support.

rules_by_sup = sort(rules, by="support", decreasing=TRUE)
inspect(rules_by_sup[1:10,]) 
     lhs                               rhs                 support   
[1]  {Apricot Danish,Opera Cake}    => {Cherry Tart}       0.04110667
[2]  {Cherry Tart,Opera Cake}       => {Apricot Danish}    0.04110667
[3]  {Almond Twist,Apple Pie}       => {Coffee Eclair}     0.03432000
[4]  {Apple Pie,Coffee Eclair}      => {Almond Twist}      0.03432000
[5]  {Almond Twist,Coffee Eclair}   => {Apple Pie}         0.03432000
[6]  {Casino Cake,Chocolate Coffee} => {Chocolate Cake}    0.03338667
[7]  {Casino Cake,Chocolate Cake}   => {Chocolate Coffee}  0.03338667
[8]  {Blueberry Tart,Hot Coffee}    => {Apricot Croissant} 0.03282667
[9]  {Apricot Croissant,Hot Coffee} => {Blueberry Tart}    0.03282667
[10] {Almond Twist,Hot Coffee}      => {Coffee Eclair}     0.02812000
     confidence lift      count
[1]  0.9553765  10.255222 3083 
[2]  0.9477405  10.237727 3083 
[3]  0.9356598   8.565175 2574 
[4]  0.9209302  11.929148 2574 
[5]  0.9245690  11.988705 2574 
[6]  0.9474082  11.341679 2504 
[7]  0.9395872  11.300360 2504 
[8]  0.9368341  11.154557 2462 
[9]  0.9280060  11.187985 2462 
[10] 0.9094437   8.325190 2109 

Let’s again visualize the outcome.

plot(sort(rules_by_sup, by="support", decreasing=TRUE)[1:10,], method = "graph")

Here we have the following itemsets that could be interesting to the bakery due to their high support and again appering in several rules with changing antecedents and consequent.

  • Apricot Danish, Opera cake, Cherry Tart
  • Almond Twist, Apple Pie, Coffee Eclaie (and Hot Coffee)
  • Casino Cake, Chocolate Coffee, Chocolate Cake
  • Apricot Croissant, Hot Coffee, Blueberry Tart

Notice also that the Almond Twist, Apple Pie and Coffee Eclaie itemset also appeared in the graph above.

1.5 Conclusion

Given what we have seen in our analysis so far, we can say that in order to generate more sales it could be a good idea to focus on the itemsets we identified by the association rule mining technique, offering these items in bundles or as special offers perhaps.

2 Part 2: Anomaly detection

2.1 Introduction

In this part the objective is to detect anomalies, i.e. patterns that do not conform to “normal” behavior using the Wisconsin Breast Cancer dataset.

There are basically three different appraches to outlier detection: statistical methods, distance/density methods or clustering methods. I decided to compare following methods:

  • Mahalanobi distance (distance-based)
  • LOF Algorithm (density-based)
  • DBSCAN (clusetering-based)
  • Expectation Maximisation (clistering-based)

2.2 Dataset Description and Exploration

The dataset was downloaded from UCI machine learning repositories. According to the dataset description, the dataset includes 699 examples of cancer biopsies with 11 features. One feature is an identification number, another is the cancer diagnosis, which is coded as 4 to indicate malignant or 2 to indicate benign. The other 9 are following numeric-valued laboratory measurements:

  • Clump Thickness
  • Uniformity of Cell Size
  • Uniformity of Cell Shape
  • Marginal Adhesion
  • Single Epithelial Cell Size
  • Bare Nuclei
  • Bland Chromatin
  • Normal Nucleoli
  • Mitoses

The description also states that there are 16 missing attribute values denoted as “?”.

We load the data naming the columns appropriately, encode the class labels as Binary (1 - Malignant,0 - Benign), putting the values “?” as NAs, and removing the ID column.

data2 = read.table("breast-cancer-wisconsin.data", sep = ",", na.strings='?', stringsAsFactors=F, header=F,
                  col.names = c("id", "clump_thickness", "uniform_cell_size", "uniform_cell_shape", 
                 "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", 
                 "bland_chromatin", "normal_nucleoli", "mitoses", "class"))
data2$class <- as.factor(ifelse(data2$class == 4, 1, 0))
data2$id <- NULL

summary(data2)
 clump_thickness  uniform_cell_size uniform_cell_shape marginal_adhesion
 Min.   : 1.000   Min.   : 1.000    Min.   : 1.000     Min.   : 1.000   
 1st Qu.: 2.000   1st Qu.: 1.000    1st Qu.: 1.000     1st Qu.: 1.000   
 Median : 4.000   Median : 1.000    Median : 1.000     Median : 1.000   
 Mean   : 4.418   Mean   : 3.134    Mean   : 3.207     Mean   : 2.807   
 3rd Qu.: 6.000   3rd Qu.: 5.000    3rd Qu.: 5.000     3rd Qu.: 4.000   
 Max.   :10.000   Max.   :10.000    Max.   :10.000     Max.   :10.000   
                                                                        
 single_epithelial_cell_size  bare_nuclei     bland_chromatin 
 Min.   : 1.000              Min.   : 1.000   Min.   : 1.000  
 1st Qu.: 2.000              1st Qu.: 1.000   1st Qu.: 2.000  
 Median : 2.000              Median : 1.000   Median : 3.000  
 Mean   : 3.216              Mean   : 3.545   Mean   : 3.438  
 3rd Qu.: 4.000              3rd Qu.: 6.000   3rd Qu.: 5.000  
 Max.   :10.000              Max.   :10.000   Max.   :10.000  
                             NA's   :16                       
 normal_nucleoli     mitoses       class  
 Min.   : 1.000   Min.   : 1.000   0:458  
 1st Qu.: 1.000   1st Qu.: 1.000   1:241  
 Median : 1.000   Median : 1.000          
 Mean   : 2.867   Mean   : 1.589          
 3rd Qu.: 4.000   3rd Qu.: 1.000          
 Max.   :10.000   Max.   :10.000          
                                          

We have only numerical features in the dataset. We can see that the missing values were all in the column bare_nuclei, therefore we can eliminate the missing values without losing to much observations.

data2 = drop_na(data2)

There is no further data preparation necessary as all our variable are already numeric and we don’t have any NAs.

To start with an initial general exploration of the whole dataset, let’s first check some plots of the features.

library(ggplot2)
ggplot(stack(data2[,1:9]), aes(x = ind, y = values)) + geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust=1)) +
  labs(title = "Boxplots of columns") + labs(x = "", y = "Values") + 
  scale_y_continuous(breaks = seq(1, 10, by = 1))

We can notice that several columns have outliers, with the column Mitoses being the most critical.

pairs(~., data = data2[,1:9], 
      main="Scatterplot Matrix")

From this plot we cannot deduce any outliers. We will see if we can get a better visualization using principal components analysis to create a simple two-dimensional mapping of the data.

# turning class variable into numeric
data2$class = as.numeric(as.character(data2$class))
# scaling data
scaled = as.data.frame(scale(data2))
# Create Principal Componants
data.pca = prcomp(scaled)
summary(data.pca)
Importance of components:
                          PC1     PC2    PC3     PC4     PC5     PC6
Standard deviation     2.5945 0.89059 0.7389 0.68213 0.61675 0.55906
Proportion of Variance 0.6731 0.07932 0.0546 0.04653 0.03804 0.03125
Cumulative Proportion  0.6731 0.75243 0.8070 0.85356 0.89160 0.92285
                          PC7     PC8     PC9    PC10
Standard deviation     0.5441 0.51110 0.35545 0.29646
Proportion of Variance 0.0296 0.02612 0.01263 0.00879
Cumulative Proportion  0.9525 0.97858 0.99121 1.00000

We have obtained 9 principal components. Each one explains a percentage of the total variation in the dataset. We see that PC1 explains 67% of the total variance. PC2 explains around 8% of the variance. So, by knowing the position of a sample in relation to PC1 and PC2, we can get a view on where it stands in relation to other samples, as PC1 and PC2 can explain around 75% of the variance.

data.plot <- data.table(data.pca$x[, 1:2])
data.plot[, entry := rownames(data2)]
data.plot[, class := data2$class]

ggplot(data.plot, aes(x = PC1, y = PC2)) +
        geom_point(aes(colour = factor(class)), size = 5, alpha = 0.3) +
        geom_text(aes(label = entry), check_overlap = TRUE) +
        ggtitle("Data Distribution and Diagnosis") +
        theme_minimal()

Lets also get a three dimaensional view of the data adding the 3rd PC.

library(plotly)
data.plot3 <- data.table(data.pca$x[, 1:3])
data.plot3[, entry := rownames(data2)]
data.plot3[, class := data2$class]

data.plot3$class = as.factor(data.plot3$class)

plot_ly(data.plot3, x = data.plot3$PC1, y = data.plot3$PC2, z = data.plot3$PC3, color = data.plot3$class, colors = c('#BF382A', '#0C4B8E'), alpha = 0.8)   %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'PC1'),
                     yaxis = list(title = 'PC2'),
                     zaxis = list(title = 'PC3')))

In general the two classes seem to be fairly good seperated. However there is a small area where both classes sre mixed, which is propably an area where we can find some outier. The cases 690, 164 and 9 could also be anomalies within the group of non-cancer patients

2.3 Mahalanobis Distance

The Mahalanobis distance is a measure of the distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. Mahalanobis distance can be used to determine multivariate outliers, where a point that has a greater Mahalanobis distance from the rest of the sample population of points is considered an outlier.

To get the minimal distance to find outliers we use multivalued outlier detection.

library(MVN)
mah <- MVN::mvn(data = scaled, mvnTest = "hz",
               univariateTest = "AD", univariatePlot = "box", 
               multivariatePlot = "qq", multivariateOutlierMethod = "quan",
               showOutliers = TRUE)

According to this method almost half of the observations are classified outliers (n = 325). Let’s visualize the outliers.

outliers.mah = data.table(obs = rownames(mah$multivariateOutliers),
                          MahalanobisOutlier = mah$multivariateOutliers$Outlier)

ggplot() +
  geom_point(data = data.plot, aes(x = PC1, y = PC2), size = 3, color = "cyan3", alpha = 0.5) +
  geom_point(data = data.plot[as.numeric(outliers.mah$obs)], aes(x = PC1, y = PC2), size = 3, color = "coral2", alpha = 0.3) + 
  theme_minimal() 

This result obviusly does not make much sense as the outliers are scattered all over the place. We will see if we get better results with the next method.

2.4 LOF Algorithm

LOF (Local Outlier Factor) is an algorithm for identifying density-based local outliers. With LOF, the local density of a point is compared with that of its neighbors. If the former is significantly lower than the latter (with an LOF value greater than one), the point is in a sparser region than its neighbors, which suggests it be an outlier.

library(DMwR)

outlier.scores = lofactor(scaled, k=10)

# Pick top 20% as outliers
n = as.integer(310*0.2)
outliers.lof = order(outlier.scores, decreasing=T)[1:n]
print(outliers.lof)
 [1]  12  14  26  28  30  34  35  44  60  68  74  90 107 109 119 132 133
[18] 142 144 153 215 237 243 279 352 361 377 385 391 396 415 416 429 455
[35] 456 472 481 491 493 499 504 506 511 520 534 544 631 632 661 672 675
[52] 678 679   9  37  82 128 674 159 541 242 287

Let’s visualize the outliers.

ggplot() +
  geom_point(data = data.plot, aes(x = PC1, y = PC2), size = 3, color = "cyan3", alpha = 0.5) +
  geom_point(data = data.plot[as.numeric(outliers.lof)], aes(x = PC1, y = PC2), size = 3, color = "coral2", alpha = 0.3) + theme_minimal() 

This algorithm detects outliers much more concentated in the area of benigne cases.

2.5 DBSCAN Algorithm

DBSCAN is a density-based clustering algorithm. It works by defining “core points” given the user defined parameters for the radius to search around a point (eps), and the minimum number of neighbouring points necessary to say that an observation is a “core point” (minPts). Core points define clusters and points that lie outside the clusters a defined outliers.

Let’s run the DBSCAN for our dataset.

library(dbscan)

dbsc = dbscan(scaled, eps = 2, minPts = 4)
dbsc
DBSCAN clustering for 683 objects.
Parameters: eps = 2, minPts = 4
The clustering contains 2 cluster(s) and 55 noise points.

  0   1   2 
 55 438 190 

Available fields: cluster, eps, minPts

The algorithm obtained two clusters and identified 55 outliers (cluster 0).

Let’s visualize the result.

data.plot[, Dclusters := dbsc$cluster]

ggplot(data.plot, aes(x = PC1, y = PC2)) +
  geom_point(aes(colour = factor(Dclusters)), size = 3, alpha = 0.5) +
  theme_minimal()

The outliers detected by this method are very different with regard to the ones detected by the LOF algorithm. We now see much more outliers scattered around the space of maligne cases.

2.6 Expectation Maximization

Expectation Maximisation is an unsupervised clustering algorithm that tries to find “similar subspaces” based on their orientation and variance.

emax = Mclust(scaled, G = 3)

data.plot[, EMclusters := emax$classification]
data.plot
           PC1         PC2 entry class Dclusters EMclusters
  1:  1.632138 -0.10110899     1     0         1          1
  2: -1.091952 -0.36764796     2     0         0          2
  3:  1.747215 -0.06835623     3     0         1          1
  4: -1.123580 -0.30561252     4     0         0          2
  5:  1.517047 -0.06200488     5     0         1          1
 ---                                                       
679:  1.871012  0.17679452   695     0         1          1
680:  2.199791  0.21113651   696     0         1          3
681: -4.014759 -0.02066566   697     1         2          2
682: -2.577111 -1.07891940   698     1         2          2
683: -2.943971 -1.14444923   699     1         2          2

Again we will visualize the result.

ggplot(data.plot, aes(x = PC1, y = PC2)) +
  geom_point(aes(colour = factor(EMclusters)), size = 3, alpha = 0.3) +
  theme_minimal()

The EM algorith again detects very different outliers to the previous methods. Here the outliers are concentrated in between the space of maligne and benigne cases.

2.7 Conclusion

Given the very different outcomes with all four methods tested on this dataset I think the best outcome was achieved with the EM method, as it detected outliers where we would expect to find them.